tensor rank
BeyondtheSigns: NonparametricTensor CompletionviaSignSeries
A nonparametric approach to tensor completion is developed based on anewmodel which we coin assign representable tensors. The model represents the signal tensor of interest using a series of structured sign tensors. Unlike earlier methods, the sign series representation effectively addresses both low-andhigh-rank signals, while encompassing manyexisting tensor models-- includingCPmodels,Tuckermodels,singleindexmodels,structuredtensorswith repeating entries--as special cases. We provably reduce the tensor estimation problem to a series of structured classification tasks, and we develop a learning reduction machinery to empower existing low-rank tensor algorithms for more challenging high-rank estimation.
Secret mixtures of experts inside your LLM
Despite being one of the earliest neural network layers, the Multilayer Perceptron (MLP) is arguably one of the least understood parts of the transformer architecture due to its dense computation and lack of easy visualization. This paper seeks to understand the MLP layers in dense LLM models by hypothesizing that these layers secretly approximately perform a sparse computation -- namely, that they can be well approximated by sparsely-activating Mixture of Experts (MoE) layers. Our hypothesis is based on a novel theoretical connection between MoE models and Sparse Autoencoder (SAE) structure in activation space. We empirically validate the hypothesis on pretrained LLMs, and demonstrate that the activation distribution matters -- these results do not hold for Gaussian data, but rather rely crucially on structure in the distribution of neural network activations. Our results shine light on a general principle at play in MLP layers inside LLMs, and give an explanation for the effectiveness of modern MoE-based transformers. Additionally, our experimental explorations suggest new directions for more efficient MoE architecture design based on low-rank routers.
Learning words in groups: fusion algebras, tensor ranks and grokking
Shutman, Maor, Louidor, Oren, Tessler, Ran
In this work, we demonstrate that a simple two-layer neural network with standard activation functions can learn an arbitrary word operation in any finite group, provided sufficient width is available and exhibits grokking while doing so. To explain the mechanism by which this is achieved, we reframe the problem as that of learning a particular $3$-tensor, which we show is typically of low rank. A key insight is that low-rank implementations of this tensor can be obtained by decomposing it along triplets of basic self-conjugate representations of the group and leveraging the fusion structure to rule out many components. Focusing on a phenomenologically similar but more tractable surrogate model, we show that the network is able to find such low-rank implementations (or approximations thereof), thereby using limited width to approximate the word-tensor in a generalizable way. In the case of the simple multiplication word, we further elucidate the form of these low-rank implementations, showing that the network effectively implements efficient matrix multiplication in the sense of Strassen. Our work also sheds light on the mechanism by which a network reaches such a solution under gradient descent.
Appendix for " Beyond the Signs: Nonparametric Tensor Completion via Sign Series "
See Section B.2 for constructive examples.Proof of Proposition 2. Based on (3) in Proposition 2, we have Risk( Z) Risk( Θ) = E null |sgnZ sgn Θ|| Θ|null . We divide the proof into two cases: α > 0 and α = . The inequality (6) now becomes Risk( Z) Risk( Θ) t null MAE(sgn Θ, sgnZ) C snull, for all 0 t < ρ(π, N) . Consider the same setup as in Theorem 2. Fix The conclusion (10) then directly follows by applying Remark A.1 to (11). 3 Proof of Theorem 2. To simplify the notation, we denote ρ = ρ(π, N). It follows from Kosorok (2007, Theorem 9.22) that the Proof of Theorem 3. By definition of ˆ Θ, we have MAE( ˆ Θ, Θ) = E null null null null null 1 2H + 1 null Assumption A.1, we establish the estimation accuracy guarantee for the large-margin estimators H log H. (29) In particualr, setting H null (1 + |N|) To apply Theorem A.1, we choose the pair ( L Here, we describe the details of the example set-up.